5.3 Check the packaging
Have you ever gotten a present before the time when you were allowed to open it? Sure, we all
have. The problem is that the present is wrapped, but you desperately want to know what’s
inside. What’s a person to do in those circumstances? Well, you can shake the box a bit, maybe
knock it with your knuckle to see if it makes a hollow sound, or even weigh it to see how
heavy it is. This is how you should think about your dataset before you start analyzing it for
real.
Assuming you don’t get any warnings or errors when reading in the dataset, you should now
have an object in your workspace named ozone. It’s usually a good idea to poke at that object a
little bit before we break open the wrapping paper.
For example, you can check the number of rows and columns.
> nrow(ozone)
[1] 7147884
> ncol(ozone)
[1] 23
Remember when I said there were 7,147,884 rows in the file? How does that match up with
what we’ve read in? This dataset also has relatively few columns, so you might be able to
check the original text file to see if the number of columns printed out (23) here matches the
number of columns you see in the original file.
5.4 Run str()
Another thing you can do is run str() on the dataset. This is usually a safe operation in the
sense that even with a very large dataset, running str() shouldn’t take too long.
> str(ozone)
Classes 'tbl_df', 'tbl' and 'data.frame': 7147884 obs. of 23 variables:
$ State.Code : chr "01" "01" "01" "01" ...
$ County.Code : chr "003" "003" "003" "003" ...
$ Site.Num : chr "0010" "0010" "0010" "0010" ...
$ Parameter.Code : chr "44201" "44201" "44201" "44201" ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Latitude : num 30.5 30.5 30.5 30.5 30.5 ...
$ Longitude : num -87.9 -87.9 -87.9 -87.9 -87.9 ...
$ Datum : chr "NAD83" "NAD83" "NAD83" "NAD83" ...
$ Parameter.Name : chr "Ozone" "Ozone" "Ozone" "Ozone" ...
$ Date.Local : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-03-01" ...
$ Time.Local : chr "01:00" "02:00" "03:00" "04:00" ...
$ Date.GMT : chr "2014-03-01" "2014-03-01" "2014-03-01" "2014-03-01" ...
$ Time.GMT : chr "07:00" "08:00" "09:00" "10:00" ...
$ Sample.Measurement : num 0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.037 0.044 0.046 ...
$ Units.of.Measure : chr "Parts per million" "Parts per million" "Parts per million" "Parts per millio\
n" ...
$ MDL : num 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 ...
$ Uncertainty : num NA NA NA NA NA NA NA NA NA NA ...
$ Qualifier : chr "" "" "" "" ...
$ Method.Type : chr "FEM" "FEM" "FEM" "FEM" ...
$ Method.Name : chr "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - U\
LTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" ...
$ State.Name : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ County.Name : chr "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
$ Date.of.Last.Change: chr "2014-06-30" "2014-06-30" "2014-06-30" "2014-06-30" ...
The output for str() duplicates some information that we already have, like the number of
rows and columns. More importantly, you can examine the classes of each of the columns to